---
title: OpenAI Evals
description: Performance OpenAI Evals using UpTrain
---

**Overview**: In this example, we will see how to use UpTrain to run openai evals. You can run any of the standard evals defined in the evals registry or create your custom one with custom prompts. We will use a Q&A task as an example to highlight both of them.

**Problem**: The workflow of our hypothetical Q&A application goes like this,
- User enters a question. 
- The query converts to an embedding, and relevant sections from the documentation are retrieved using nearest neighbour search. 
- The original query and the retrieved sections are passed to a language model (LM), along with a custom prompt to generate a response.  

Our goal is the evaluate the quality of the answer generated by the model using openai evals.

**Solution**: We illustrate how to use the Uptrain Evals framework to assess the chatbot's performance. We will use a dataset built from logs generated by a chatbot made to answer questions from the [Streamlit user documentation](https://docs.streamlit.io/). For model grading, we will use GPT-3.5-turbo with grading type as ['coqa-closedqa-correct'](https://github.com/openai/evals/blob/main/evals/registry/modelgraded/closedqa.yaml) as well as define our custom grading prompt for the same.

## Install required packages


```python
# !pip install uptrain
```


```python
import os

import polars as pl
```

Let's now load our dataset and see how that looks


```python
url = "https://uptrain-assets.s3.ap-south-1.amazonaws.com/data/qna-streamlit-docs.jsonl"
dataset_path = os.path.join(os.getcwd(), "qna-notebook-data.jsonl")

if not os.path.exists(dataset_path):
    import httpx

    r = httpx.get(url)
    with open(dataset_path, "wb") as f:
        f.write(r.content)

dataset = pl.read_ndjson(dataset_path).select(
    pl.col(["question", "document_text", "answer"])
)
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])
```

![sample_dataset.png](https://github.com/uptrain-ai/uptrain/assets/43818888/6081b07a-4dbb-449f-b9ed-97260a9c86d2)

So, we have a bunch of questions, retrieved documents and final answers (i.e. LLM responses) for them. Let's evaluate the correctness of these answers using OpenAI evals.

# Using UpTrain Framework to run evals

UpTrain provides integrations with openai evals to run any check defined in the [registry](https://github.com/openai/evals/tree/main/evals/registry/evals).

We wrap these evals in an Operator class for ease of use.

1. `OpenAIGradeScore`: Calls openai evals with the given eval_name. Provide a column corresponding to the input and completion. [Documentation](https://uptrain-ai.github.io/uptrain/operators/language/OpenAIGradeScore/)

2. `ModelGradeScore`: Define your model grading eval. Define your custom prompt, the weightage given to each option and the mapping to link dataset columns to the variables required in the prompt. [Documentation](https://uptrain-ai.github.io/uptrain/operators/language/ModelGradeScore/) 


```python
from uptrain.framework import Check
from uptrain.operators import ModelGradeScore, OpenAIGradeScore, Histogram

check = Check(
    name="Eval scores",
    operators=[
        ## Add your checks here
        # First is the openai grade eval
        OpenAIGradeScore(
            col_in_input="document_text",
            col_in_completion="answer",
            eval_name="coqa-closedqa-correct",
            col_out="openai_eval_score",
        ),
        # Second is our custom grading eval
        ModelGradeScore(
            grading_prompt_template="""
            You are an evidence-driven LLM that places high importance on supporting facts and references. You diligently verify claims and check for evidence within the document to ensure answers rely on reliable information and align with the documented evidence.",
            
            You are comparing an answer pulled from the document for a given question
            Here is the data:
            [BEGIN DATA]
            ************
            [Document]: {document}
            ************
            [Question]: {user_query}
            ************
            [Submitted answer]: {generated_answer}
            ************
            [END DATA]
            
            Compare the factual content of the submitted answer with the document as well as evaluate how well it answers the given question. Ignore any differences in style, grammar, or punctuation.
            Answer the question by selecting one of the following options:
            
            (A) The submitted answer is a subset of the document and answers the question correctly.
            (B) The submitted answer is a subset of the document but is not an appropriate answer for the question.
            (C) The submitted quote is a superset of the document but is consistent with the document as well as answers the question correctly.
            (D) The submitted quote is a superset of the document but is consistent with the document although it does not answer the question correctly.
            (E) The submitted quote is a superset of the document and is not consistent with the document.
            """,
            eval_type="cot_classify",
            choice_strings=["A", "B", "C", "D", "E"],
            choice_scores={"A": 1.0, "B": 0.2, "C": 0.5, "D": 0.1, "E": 0.0},
            context_vars={
                "document": "document_text",
                "user_query": "question",
                "generated_answer": "answer",
            },
            col_out="custom_eval_score",
        ),
    ],
    plots = [Histogram(x="custom_eval_score", nbins=5)]
)
```

Now that we have defined our evals, we will wrap them in a `CheckSet` object. `CheckSet` takes the source (i.e. test dataset file), the above-defined `Check` and the directory where we wish to save the results.


```python
from uptrain.framework import CheckSet, Settings
from uptrain.operators import JsonReader

# Our checkset has source as our dataset and checks as the above-defined check
check_set = CheckSet(
    source=JsonReader(fpath=dataset_path),
    checks=[check],
)
```

### Running the checks


```python
# We first define a settings object 
settings = Settings(
    logs_folder = "uptrain_logs",  # Results are saved in this folder
    openai_api_key = "sk-**********************************" 
)

# Run the checkset
check_set.setup(settings).run()
```

### [Optional] Visualize results on streamlit


```python
from uptrain.dashboard import StreamlitRunner
runner = StreamlitRunner(settings.logs_folder)
runner.start()
```


## Using UpTrain's Managed Service

#### Create an API Key

To get started, you will first need to get your API key from the [Uptrain Website](https://uptrain.ai/dashboard).

1. Login with Google
2. Click on "Create API Key"
3. Copy the API key and save it somewhere safe


```python
from uptrain import APIClient, Settings

UPTRAIN_API_KEY = "up-*******************************"   # Insert your UpTrain API key here

settings = Settings(
    uptrain_access_token=UPTRAIN_API_KEY,
)

client = APIClient(settings)
```

#### Step 2: Add dataset
Unlike the previous method where you had to create a dataset in Python, this method requires you to upload a file containing your dataset. The supported file formats are:

- .csv
- .json
- .jsonl

You can add the dataset file to the UpTrain platform using the `add_dataset` method.

To upload your dataset file, you will need to specify the following parameters:
- `name`: The name of your dataset
- `fpath`: The path to your dataset file

Let's say you have a dataset file called `qna-notebook-data.jsonl` in your current directory. You can upload it using the code below.


```python
client.add_dataset(name="qna-dataset", fpath=dataset_path)
```

#### Step 3: Add Checkset
A checkset contains the operators you wish to evaluate your model on.

You can add a checkset using the `add_checkset` method.

To add a checkset, you will need to specify the following parameters:
- `name`: The name of your checkset
- `checkset`: The checkset you wish to add
- `settings`: The settings you defined while creating the API client


```python
client.add_checkset(
    name="qna-checkset",
    checkset=check_set,
    settings=settings
)
```

#### Step 4: Add run
A run is a combination of a dataset and a checkset.

You can add a run using the `add_run` method.

To add a run, you will need to specify the following parameters:
- `dataset`: The name of the dataset you wish to add
- `checkset`: The name of the checkset you wish to add


```python
response = client.add_run(
    dataset="qna-dataset",
    checkset="qna-checkset"
)
```

#### Step 5: View the results
You can view the results of your evaluation by using the `get_run` method.


```python
client.get_run(response["run_id"])
```


You can also view the results on the [UpTrain Dashboard](https://demo.uptrain.ai/dashboard/) by entering your API key as a password.

![image](https://github.com/uptrain-ai/uptrain/assets/43818888/0dc6bb12-ab66-4bef-bfb0-31fab0f112f7)

